1) Multiple features of one input

1) Polynomial regression

Instead of using simple linear regression model, you could consider even higher order polynomials.

Screenshot taken from Coursera 1:00

So here's our generic polynomial regression model, where we take our observation, $y_i$, and model it as this polynomial in terms of, for example, square feet of our house, which is just some input x. And then we assume that there's some error, epsilon i. So that's the error associated with the ith observation.

And what we see is that in this model, in contrast to our simple linear regression model, we have all these powers of x that are now appearing in the model. And what we can do, is we can treat these different powers of x as features.

feature 1 = 1 (constant)
feature 2 = x
feature 3 = $x^2$

And associated with each one of these features in our model is a parameter. So we have some p+1 parameters, $w_0$, which is just the intercept term. All the way up to wp, the coefficient associated with pth power of our input.

Screenshot taken from Coursera 3:37

2) Modeling seasonality

Let's just go through one application in detrending time series.

Screenshot taken from Coursera 2:00

Let's write down the model for this application

Screenshot taken from Coursera 7:00

Let's apply this model to our housing data

Screenshot taken from Coursera 8:00

3) Where we see seasonality

So we keep talking about this housing application. And it's really a nice intuitive way to describe the different methods in regression that we're gonna be talking about.

But regression, like we've mentioned, is much, much, much more widely applicable. And this notation of capturing seasonality also appears in lots of applications beyond just housing. And so I wanted to spend a little time talking about other places where we see this seasonality or seasonal effects coming into play.

Screenshot taken from Coursera 1:40

4) Regression with general features of 1 input

So we talked about polynomial regression, where you have different powers of your input. And we also talked about seasonality where you have these sine and cosine bases. But we can really think about any function of our single input. So let's write down our model a little bit more generically in terms of some set of features.

And I'm gonna denote each one of my features with this function h.

$h_0$: first feature
$h_1$: second feature
$h_D$: last feature So we can more compactly represent this model using the sigma notation that we introduced previously.

Screenshot taken from Coursera 1:33

So, going back to our regression flow chart or block diagram here, we kinda swept something under the rug before. We never really highlighted this blue feature extraction box, and we just said the output of it was x. Really, now that we've learned a little bit more about regression and this notion of features, really the output of this feature extraction is not x but h(x). It's our features of our input x. So x is really the input to our feature extractor, and the output is some set of functions of x. So for the remainder of this course, we're gonna assume that the output of this feature extraction box is h of x.

Screenshot taken from Coursera 2:22

2) Incorporating multiple inputs

1) Motivating the use of multiple inputs

Well, up to this point in this course we've assumed that we just have a single input, like square feet, and we're trying to use that to predict the output, like the value of the house. But often we have lots of different attributes, different inputs, that we might wanna use for this task of predicting some output.

So what we're saying is that we need to add more input to our regression model.

Screenshot taken from Coursera 1:45

Like we mentioned at the beginning of this module, there are lots and lots of applications that use this idea of multiple regression. And I'm just gonna talk through one that I think is really cool, and that's reading your mind.

So, you get some brain scan, maybe it's MEG or FMRI, but the point is that there's some recording of your brain activity and in this case it's going to be in response to seeing some word or some image or something like this and what we get out is an image of your brain.

And this brain is divided into what are called voxels, you can think of them as pixels but they're these little volumetric regions where we have intensities associated with each one of those different voxels and we're going to use these intensities as our multiple inputs. So these are going to be the different features that are going into our model to predict whether you felt very sad or very happy, in response to whatever you were shown. So we're going to have some data where people were shown an image and they respond on the scale about whether they were very sad or very happy and there is some scale in which they respond to this question. And so what we have are these pairs of brain images and happiness responses.

So the happiness responses are output, the brain image is our input to this regression model, and again, just to emphasize, the place where this multiple regression comes in, is the fact that we don't have just one input. We have a whole collection of inputs, the whole set of different voxels in your brain. The intensities associated with those are what we're using to try and find the relationship between this brain image and the response of happiness. Okay so to be clear, the way in which we read your mind is we can think about taking your brain scan and predicting whether you feel happy or sad.

Screenshot taken from Coursera 3:51

2) Defining notation

The reason it's important is because unfortunately, there's no standard convention in the machine learning community for how to denote the things that we're gonna be going through in this module.

x: bold-faced notation of x to represent the fact that this is a vector
x[j]: our notational conventions are gonna be, bold x, square brackets of j is gonna take this vector x. And just like you would in Python, it's gonna grab out that jth element. So the result of that is gonna be the jth input to our model, and that's just a scalar, like number of square feet for the house.
$h_j$(x): For our features, we're gonna use these functions h, it might be a function of multiple inputs. So in general, our features are a function of this bold x, this entire d-dimensional vector.
x$_i$: bold x sub i. That's our ith input, so the input associated with our ith observation, the ith house in our data set for our housing application
x$_i$[j]: Bold x, sub i, square brackets j, means we're gonna look at this d dimensional input vector for this ith house. So, the number of square feet, bedrooms, bathrooms, etc., for this ith house. And we're gonna grab out the jth input, which might be number of bathrooms. That is just a scalar.

Screenshot taken from Coursera 3:17

3) Regression with features with multiple inputs

When we have these multiple inputs, the simplest models we can think of is just assuming that our ith observation is just a function directly of the inputs themselves, not other functions of the inputs. Just taking number of square feet, number of bathrooms, number of bedrooms and plugging those directly entirely into out linear model. And again, we still have this noise term, epsilon i.

Just to be very explicit about the features associated with this simple hyperplane model

feature 1 = 1 (constant)
feature 2 = x[1], first input, for example: square ft

And this goes on and on till we get to our last input, which is the little d+1 feature

Screenshot taken from Coursera 1:20

For generically, instead of just a simple hyperplane just like we talked about, instead of a single line, you can fit a polynomial. Well instead of just a hyperplane, we can fit some D-dimensional curve. This is capital D-dimensional curve. Because we're gonna assume that there's some capital D different features of this of these multiple inputs.

For example:

feature 1 = zero feature is just that one constant term that just shifts up and down wherethis curve leads in the space.
feature 2 = first input like in the hyperplane example which is sq. ft

And then we get all the way up to our capital D feature which is some function of any of our inputs to our regression model. So this is our generic multiple regression model with multiple features. And again we can take this big sum and represent it with this capita sigma notation.

Screenshot taken from Coursera 3:21

More on Notation

N: the number of observation we have
d: the number of inputs
D: the number of features we take of those inputs

Screenshot taken from Coursera 3:48

4) Interpreting the multiple regression fit

So just like we did in simple linear regression, we can think about interpreting the coefficients of our fitted function.

let's remember what we did for the simple linear regression model, we have $\hat w_0$, and $\hat w_1$ as our estimated parameters of this model

$\hat w_0$: an interpretation in terms of the intercept of this line. So, the value of a house with no square feet, let's just ignore that one, that one's gonna have the same interpretation throughout all of these models that we're talking about.
$\hat w_1$: the predicted change in our output per unit change in our input. So, predicted change in the house value for one square foot change in the house that we're looking at.

Screenshot taken from Coursera 1:00

How do we interpret the coefficients if we have two inputs?

Well in this case, what we can do is, we can think about just fixing the other input. So let's fix our first input, the number of square feet of the house. And then think about what's the effect of the number of bathrooms. And if we fix the number of square feet, what that's equivalent to is we're taking a slice through this space for however many square feet we're looking at. There's this hyperplane living here in 3D space we take a slice, and once you slice that hyper plane, you just get a line.

Screenshot taken from Coursera 1:45

So, that's this line here, for fixed number of square feet.

And, then when we think about the interpretation of $\hat w_2$, the coefficient on the number of bathrooms in the house. What that's saying is, well it's the slope of this line, so that means for a single unit change in our input, which here is number bathroom. So, adding one bathroom, what's the predicted change in the output (value of the house)?

Just to be clear, we're taking a house of a fixed size and we're adding a bathroom, maybe we're changing some room to be a bathroom or knocking something down, putting in a bathroom. And we're saying, how much does that increase the value of my house to make that change? Not changing the size of the house, that's gonna stay the same, but adding a bathroom to the number of counts of bathrooms in this house.

That is the interpretation of $\hat w_2$.

Screenshot taken from Coursera 2:50

So, when we go to a more general model, with some little d different inputs. In this case, when we think about interpreting any given coefficient of our fitted model, we're gonna fix all the other inputs in the model, and we're just gonna look at that one input that we can vary.

For example, number of bedrooms. And so when we're looking at the coefficient associated with number of bedrooms, the interpretation is what's the predicted change in the value of my house if I change just adding one bedroom to the house, with everything else fixed.

When you think about the interpretation of the coefficient you have to be very careful that the interpretation is in the context of what you have in the model.

So let me give you an example where if I think about the coefficient on the number of bedrooms. If I have square footage of the house in my model, well what does it mean that I increase the number of bedrooms? What I've had to do is make some other room smaller. Let's say I am always working with manipulating the existing bedrooms. So the house currently has three bedrooms. And I'm going to modify it to have four bedrooms. Well what that probably means for the house, is that that house has four smaller bedrooms than a house of the same size with three bedrooms. So for a lot of people, that might not actually add value, that might decrease the value. To have four small bedrooms might have less value than three nicer, more grand bedrooms. Depends on the person or family of course but you can end up with a negative coefficient on the influence of the number of bedrooms to the value of the house. Because what it's saying is that as you keep adding, imagine adding all the way up to ten bedrooms, of course the impact on the value of the house can go down, if you're not able to increase the size of the house, if you're fixing the size of the house.

But what if you remove square feet from the model, and you refit this model. Well, let's think about the coefficient on number of bedrooms. In this case, it's most likely going to be some positive number because number of bedrooms is indicative of the size of the house. We can use that as a proxy when we don't have square feet in the model itself. So if I think of reading some listing with a house with ten bedrooms, probably I don't imagine them to be these tiny tiny little bedrooms, I just imagine a really big house. So I imagine the value of that house would be quite large compared to a house with one or two bedrooms. So what I'm saying is that thinking about the coefficient, this maybe positive, big positive number, when I don't have square feet versus the coefficient associated with number of bedrooms if I include square feet being negative. I can have totally different interpretations unless I think of the interpretation as I should in the context of what's included in the model.

It's a really common mistake to just look at the coefficient and say, oh, as I increase the number of bedrooms the value goes down. No, think about the coefficient and the context of what you've put into the model.

Screenshot taken from Coursera 3:46

Let's try to interpret the coefficient in a polynomial regression model

And here I'm just assuming that we have one input, like number of square feet, and our features are different powers of this input. So what would be the interpretation of the jth coefficient? Well I have to think about fixing everything else in the model, but if everything else is a power of this one input that I'm changing, I can't do that.

So here, unfortunately, we can't think about interpreting the coefficients in the way that we did in the simple hyperplane example of the previous slide. So, just a little word of warning that if you're ever in a situation where you can't hold all your other features fixed, then you can't think about interpreting the coefficient.

Screenshot taken from Coursera 7:00

3) Setting the stage for computing the least squares fit

1) Review of matrix algebra

http://tutorial.math.lamar.edu/Classes/DE/LA_Matrix.aspx

2) Rewriting the single observation model in vector notation

In this part of this module, we're talking about the algorithms associated with multiple regression. And in particular, like we did in the simple linear regression case, we're gonna talk about two different algorithms.

One is just a closed form solution and the other is gradient descent.

And there are gonna be multiple steps that we have to take to build up to deriving these algorithms.

Step 1: Rewrite the regression model in matrix notation

first vector: our parameter, regression coefficients, w = $[w_0, w_1, ..., w_D]$
second vector: Which are all of our different features of the input, h($x_i$) = $[h_0 (x_i) , h_1 (x_i), ..., h_D (x_i)]$
error term: $\epsilon _i$

So, we have: $$y_i = w_0h_0 (x_i) + w_1h_1 (x_i) + ... + w_Dh_D (x_i) + \epsilon_i $$ $$y_i = w^Th(x_i) + \epsilon_i $$

The result of $w^Th(x_i)$, w transpose times h is just a scalar.

So equivalent to this. Multiplying w transpose times h, well you can also think about multiplying h transpose of x times w

Screenshot taken from Coursera 6:00

3) Rewriting the model for all observations in matrix notation

It's gonna be helpful in our deriving these algorithms to write the equation for all the observations stacked up together.

Let's start by just stacking up all our observations, taking all these pink squares that I showed on this previous slide here and putting them into one big vector. Where this vector is:

$y_1$: my first house sale
$y_2$: my second house sale
$y_N$: that's my capital Nth house sale.

I can write the model for all these observations in this big matrix vector notation, where I have the w vector contains $[w_0, w_1, ..., w_D]$. And I know that each one of my observations has an error associated with it, vector epsilon $[\epsilon_1, \epsilon_2, ..., \epsilon_N]$

Let's describe this big green box, this matrix. Well, how do I form my first observation? I multiply the weights by the features of that first observation, so the features of that first observation are $[h_0(x_1), h_1(x_1), ..., h_D(x_1)]$. If we just look at the first row of the matrix, that is exactly H transpose, $h^T(x_1)$. So from $h_0(x_1)$ all the way up to $h_0(x_N)$, and we can call this matrix capital H.

Based on this, we can write our entire regression model for capital N observation: $$y = Hw + \epsilon$$

Screenshot taken from Coursera 4:25

4) Computing the cost of a D-dimensional curve

To fit a function to data, we need to be able to talk about what the cost is of any given fit to the data and then search our algorithm is going to be to search over all these different fits to minimize this cost. So let's again talk about the cost of the fit for multiple regression. So in terms of our flow chart for regression, we're talking about the quality metric term.

Screenshot taken from Coursera 0:28

Let's recall the residual sum of squares

Our residual sum of squares was looking at taking our actual observation $y_i$, subtracting off the fitted function at that point $x_i$, taking the square, summing over all observations in our training dataset. And we have $\hat y_i$ as our predicted value of observation $y_i$, if we use a line defined by $w_0$ and $w_1$.

Screenshot taken from Coursera 1:30

Now we can talk about residual sum of squares in the case of multiple regression

The residual is the difference between our actual observation and our predicted value. So we're gonna plug in our predicted value there. In this case, we take the weight vector, w and then we multiply our features for that observation by that factor. So the predicted value simple is: $h^T(x_i)w$. So, just to use the notation that we've been using before, this here is $\hat y_i$, assuming we use w, as the parameters of this fit.

Screenshot taken from Coursera 3:33

Let's take this residual sum of squares and write it in matrix notation

\begin{eqnarray} RSS(w) &=& \sum_{i=1}^n(y_i - h(x_i)^Tw)^2 \\ &=& (y - Hw)^T(y-Hw) \end{eqnarray}

Break down the math (part 1)

The first part, we are looking at this term: $(y-Hw)$, we're going to think about stacking up all of our predicted values. So we have a vector of $[\hat y_1, \hat y_2, ..., \hat y_N]$ (N predicted observations), the parameter vector w, and a big matrix H which represent the matrix of all of our features per observation stacked up together.

Based on this, we know that $\hat y$, the vector of all of our end predicted observations is equal to : $\hat y = Hw$

So, let's take this term, and plug it in: $y - Hw$. So this is equivalent of looking at our vector of actual observed values and subtracting our vector of predicted values. So we take all our house sales prices, and we look at all the predicted house prices, given a set of parameters, w, and we subtract them. The result of this is the vector of residuals, because the result of this is the difference between my first house sale and my predicted house sale.

$$y - Hw = y - \hat y= \begin{bmatrix} residual_1 \\ residual_2 \\ ... \\ residual_N \end{bmatrix}$$

Screenshot taken from Coursera 7:00

Break down the math (part 2) Let's move on to the second part of the reasoning behind this equation here. when I multiply this vector, we now know that the result of this is a vector of residuals by, it's transposed. \begin{eqnarray} (y - Hw)^T(y-Hw) &=& (residual_1^2 + residual_2^2 + ... + residual_N^2) \\ &=& \sum_{i=1}^n residual_i^2 \end{eqnarray}

And this, by definition, that is exactly what residual sum of squares is. This is residual sum of squares using these w parameters. I'm summing up the square of my residuals from the predictions made using w.

Screenshot taken from Coursera 7:40

4) Computing the least squares D-dimensional curve

1) Computing the gradient of RSS

So now we're onto the final important step of the derivation, which is taking the gradient. \begin{eqnarray} \bigtriangledown RSS(w) &=& \bigtriangledown [(y - Hw)^T (y - Hw)] \\ &=& -2H^T(y-Hw) \end{eqnarray}

I'm going to walk through an analogy to 1D case, and we'll see some patterns, and maybe you'll believe that that's the result of the matrix case.

So, in particular, if we think about taking the derivative with respect to w of a function that is $(y - hw)(y - hw)$ where these things are all scalars. So this is the 1D analog to this equation here, where the gradient is just this derivative of this one parameter w. What's the derivative of this?

Screenshot taken from Coursera 2:30

2) Approach 1: closed-form solution

So the first approach that we're gonna talk about is just a closed form solution where we take our gradient, and simply set it equal to zero, and solve for w.

Here's my gradient of our residual sum of squares, and I've set it equal to zero. And now let's solve for w.

We have a whole collection of different parameters, $w_0$ all the way up to $w_D$. These are the things multiplying all the features we're using in our multiple regression model. And in one line I've written the solution to the fit. I've fit all of $w_0$, $w_1$, all the way up to $w_D$ just by doing this matrix multiply.

Screenshot taken from Coursera 3:26

3) Discussing the closed-form solution

Let's think a little bit about this form of this closed form solution $$\hat w = (H^TH)^{-1}H^Ty$$

H is the green matrix, which has:

N rows: number of observation
D columns: number of features. So $H^T$ is a (D x N) matrix, and H is a (N x D) matrix, and if we multiply these 2 matrices together, the result is a square matrix, a (D by D) matrix, or (number of features by number of features). And then we need to take the inverse of this matrix. In most cases, this matrix will be invertible. If the number of observations we have is larger than the number of features, N > D. Okay that means that the this matrix is full rank and then we can take its inverse.

The complexity of this matrix is $O(D^3)$, and what that means is that the number of operations we have to do to invert this matrix scales cubically with the number of features in our model. So if you have lots and lots and lots of features this can be really computationally intensive to do.

Screenshot taken from Coursera 3:52

4) Approach 2: gradient descent

So the simple alternative approach is gradient descent.

Remember that the gradient descent algorithm, we just initialize our vector of parameter somewhere and take these gradient steps. And eventually, we will converge to the optimum of this problem.

So what does this algorithm look like for multiple regression?

It looks very similar to our simple linear regression, where we say while not converged, we're gonna take our w parameters. And we're gonna update them by subtracting sum step size atta times the gradient of our residual sum of squares, at our previous set of parameters wt. $$w^{(t+1)} = w^{(t)} - \eta\bigtriangledown RSS(w^{(t)})$$

What is our gradient of the residual sum of squares? $$\bigtriangledown RSS(w^{(t)}) = -2H^T(y - Hw)$$

So we can update our gradient algorithm:

$$w^{(t+1)} = w^{(t)} + 2\eta H^T(y - Hw^{(t)})$$

And h times w at iteration t, $Hw^{(t)}$, is my predicted set of observations, the whole vector of them. Assuming that I use w at iteration t performing those predictions. Okay, so what this version of the algorithm is doing is it's taking our entire w vector, all the regression coefficients in our model, and updating them all at once using this matrix notation shown here.

Screenshot taken from Coursera 1:30

5) Feature-by-feature update

Let's think about the update just to a single feature

First, we can go back to this form of our residual sum of squares and drive it directly. \begin{eqnarray} RSS(w) &=& \sum_{i=1}^N(y_i - h(x_i)^Tw)^2 \\ &=& \sum_{i=1}^N(y_i - w_0h_0(x_i) - w_1h_1(x_i) - ... - w_Dh_D(x_i))^2 \end{eqnarray}

What's the gradient? It's just a vector of a bunch of partials with respect to $w_0$ then $w_1$ ,all the way up to $w_D$. Let's just look at one element, which is the partial of this residual sum of squares with respect to $w_j$, and that will give us our update for this j entry. So, the partial derivative ofthis function with respect to $w_j$.

We keep this sum on the outside, and then we take the derivative of this function with respect to $w_j$, and when we are taking the derivative, we have $(-h_j(x_j))$ as our coefficient associated with $w_j$

$$\sum_{i=1}^N 2(y_i - w_0h_0(x_i) - w_1h_1(x_i) - ... - w_Dh_D(x_i))(-h_j(x_j))$$$$= -2\sum_{i=1}^N h_j(x_j)(y_i - h(x_i)^Tw)$$

So I'm gonna plug this into the update to the Jth feature weight, where I take the previous weight of that feature and subtract off a step size times the partial.

$$w_j^{(t+1)} \leftarrow w_j^{(t)} - \eta(-2\sum_{i=1}^N h_j(x_j)(y_i - h(x_i)^Tw^{(t)}))$$

Again, we can add a little bit of interpretation here where this part here, this is, I'm taking the features, all the features for my Ith observation, multiplying by the entire w vector. So, this is my predicted value of my Ith observation using $w^{(t)}$.

$$h(x_i)^Tw^{(t)} = \hat y_i(w^{(t)})$$

Screenshot taken from Coursera 4:40

Now, let's think about interpreting this update here. In particular, let's assume that $w_j$ corresponds to the coefficient associated with number of baths. So, let's say the Jth features, let's just assume is number of bathrooms.

If I'm underestimating the impact of the number of baths on my predicted value of the house. So, what that means is, if I look along this bathrooms direction, and I look at the slope of this hyperplane, I'm saying that it's not steep enough so increasing bathrooms actually has more impact on the value of the house than my currently estimated model thinks it does. In this cases, $\hat w_j^{(t)}$ is too small.

Then what I'm gonna have is that my observations in general are larger than my predicted observations, so this term here $(y_i - \hat y_i(w^{(t)}))$ on average, will be positive.

The impact of that is this whole term here, what we're adding to $w_j$, will be positive. So, we're gonna increase $w_j$.

Screenshot taken from Coursera 8:30

6) Algorithmic summary of gradient descent approach

Let's actually summarize the entire gradient descent algorithm for multiple regression. Stepping through every step of this algorithm.

First, we're just gonna initialize all of our different parameters to be zero at the first iteration. Let's just assume that they're all initialized to zero and we're gonna start our iteration counter at one.

init $w^{(1)} = 0$, t = 1

Then we said while the gradient of our residual sum of squares, the magnitude of that gradient is sufficiently large. Larger than some tolerance epsilon then we were going to keep going.

while $||\bigtriangledown RSS(w^{(t)})|| > \epsilon$

So what are the elements of the gradient of residual sum of squares? Well, it's a vector where every element is the partial derivative with respect to some parameter. I'm gonna refer to that as partial of j, okay? So, when I take the magnitudeof the vector I multiply the vector times its transposed, take the square root. That's equivalent to saying I'm gonna sum up the partial derivative with respect to the first feature squared plus all the way up to the partial derivative of the capital Dth feature, and the indexing start at 0.

$$||\bigtriangledown RSS(w^{(t)})|| = \sqrt{partial[0]^2 + ... + partial[D]^2}$$

So if the result of this is greater than epsilon then I'm gonna continue my gradient descent iterates. If it's less than epsilon then I'm gonna stop.

But let's talk about what the actual iterates are. Well, for every feature in my multiple regression model, first thing I'm going to do is I'm going to calculate this partial derivative, with respect to the jth feature. And I'm going to store that, because that's going to be useful in both taking the gradient step as well as monitoring convergence as I wrote here. So this jth partial, we derived it on the previous slide. It has this formed and then my gradient step takes that jth coefficient at timed t and subtracts my step size times that partial derivative. And then once I cycle through all the features in my model, then I'm gonna increment this t counter. I'm gonna check whether I've achieved convergence or not. If not I'm gonna loop through, and I'm gonna do this until this condition, this magnitude of my gradient is less than epsilon.

Screenshot taken from Coursera 3:20

5) Summarizing multiple regression

Screenshot taken from Coursera 0:45